WEEK 7: WRITING FUNCTIONS

Thursday, February 23nd

Today we will…

  • Mini lecture on text material
    • Function Basics
    • Environments & Scope
    • Lab 7 Preview
  • Work time:
    • PA 7: Functions
    • Lab 7: Functions & Fish
    • Challenge 7: Incorporating Multiple Inputs

Why write functions?

  • Functions allow you to automate common tasks

  • We’ve been using functions since Day 1

  • Did you ever find yourself copy-pasting an analysis and changing small parts?

Writing your OWN functions has 3 big advantages over copy-and-paste:

  1. Your code is easier to read

  2. To change your analysis, simply change the function

  3. No more mistakes from copy-paste

Function Basics

Function Syntax


A (very) simple function

add_two <- function(x) {
  
  x + 2
  
}


Let’s call the function!

add_two(5)
[1] 7

Naming functions – add_two <-

The name of the function is chosen by the author.

add_two <- function(x) {
  
  x + 2
  
}

A word of caution: Function names have no inherent meaning.

add_three <- function(x) {
  
  x + 7
  
}

What did you expect?

add_three(5)
[1] 12

Function Arguments

The argument(s) of the function are chosen by the author.

add_two <- function(x) {
  
  x + 2
  
}
  • arguments are temporary variables with general names

  • x, y, z – vectors

  • df – dataframe

  • i, j – indices

We can supply a default argument value – something

add_something <- function(x, something = 2) {
  
  return(x + something)
  
}

something defaults to 2

add_something(x = 5)
[1] 7
add_something(x = 5, something = 6)
[1] 11

If you do not supply a default value, the argument is required:

add_something <- function(x, something) {
  
  x + something
  
}

add_something(x = 2)
Error in add_something(x = 2): argument "something" is missing, with no default

Function { body }

The body of the function is where the action happens.

add_two <- function(x) {
  
  x + 2
  
}

return()

Your function will “give back” whatever would normally “print out”.

add_two <- function(x) {
  
  x + 2
  
}


7 + 2
[1] 9
add_two(7)
[1] 9

Explicit return()s

I prefer to have explicit returns in my functions!

add_two <- function(x) {
  
  return(x + 2)
  
}

Why? It makes debugging easier.

Input validation

add_something <- function(x, something) {
  
  stopifnot(is.numeric(x))
  
  x + something
  
}


add_something(x = "statistics", something = 5)
Error in add_something(x = "statistics", something = 5): is.numeric(x) is not TRUE
add_something <- function(x, something) {
  
  if(is.numeric(x) != TRUE){
    stop("Please provide a numeric input for the x argument.")
  }
  
  x + something
  
}

add_something(x = "statistics", something = 5)
Error in add_something(x = "statistics", something = 5): Please provide a numeric input for the x argument.

Recall DeMorgan’s law!

add_something <- function(x, something) {
  
  stopifnot(is.numeric(x), is.numeric(something))
  
  x + something
  
}

add_something(x = 2, something = "R")
Error in add_something(x = 2, something = "R"): is.numeric(something) is not TRUE

Environments

Environments

  • The top right pane of your RStudio shows your environment.
  • This is the “current state” of the objects you’ve created.

  • The code inside the function executes in the function environment.
  • It does not change your global environment.

Dynamic Lookup

If an object doesn’t exist in the function’s environment, the global environment will be searched next; if there is no object in the global environment, the program will error out.


add_two <- function() {
  
  x + 2
  
}

add_two()
Error in add_two(): object 'x' not found
x <- 10

add_two()
[1] 12

Name Masking

Objects you make in the function don’t affect “real life”.

add_two <- function(x) {
  
  my_result <- x + 2
  
  my_result
  
}


my_result <- 2000

This is an example of name masking, where names defined inside of a function mask names defined outside of a function.

add_two(5)
[1] 7
my_result
[1] 2000

Debugging

The faces of debugging (by Allison Horst)

Debugging Strategies

  • Interactive coding (highlight small lines within your function to run them independent of the rest)

  • print() Debugging

  • Rubber Ducking

In general…

  1. Write a simple example once (without a function)

  2. Generalize by assigning variables.

  3. Write into a function.

  4. Call the function on desired arguments

Example – find_car_make()

Write a function called find_car_make() that takes as input the name of a car, and returns only the “make”, or the company that created the car.

Tip

For example, find_car_make(“Toyota Camry”) should return “Toyota” and find_car_make(“Ford Anglica”) should return “Ford”.

make <- str_extract(string = "Toyota Camry", 
            pattern = "[:alpha:]*"
            )
make
[1] "Toyota"
car_name <- "Toyota Camry"

make <- str_extract(string = car_name, 
                    pattern = "[:alpha:]*"
                    )
make
[1] "Toyota"
find_car_make <- function(car_name){
  
  make <- str_extract(string = car_name, 
                      pattern = "[:alpha:]*"
                      )
  return(make)
  
}

find_car_make("Toyota Camry")
[1] "Toyota"

Calling functions in data sets

Consider mtcars

data(mtcars)
head(mtcars, n = 3)
               mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4     21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710    22.8   4  108  93 3.85 2.320 18.61  1  1    4    1


Let’s use our function to create a new column in the data called make that gives the make of each car!

rownames_to_column() ❤️


mtcars |> 
  rownames_to_column("make_model") |> 
  mutate(make = find_car_make(make_model),
         .after = make_model
         ) |> 
  head(n = 3)
     make_model   make  mpg cyl disp  hp drat    wt  qsec vs am gear carb
1     Mazda RX4  Mazda 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
2 Mazda RX4 Wag  Mazda 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
3    Datsun 710 Datsun 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1

Lab 7 & Challenge 7

Lab 7: Functions & Fish

Challenge 7: Incorporating Multiple Inputs

Standardizing Between 0 and 1

std_to_01 <- function(var) {
  stopifnot(is.numeric(var))
  
  num <- var - min(var, na.rm = TRUE)
  denom <- max(var, na.rm = TRUE) - min(var, na.rm = TRUE)
  
  num / denom
}

Could our function be more efficient?

Ugh. Still copy + pasting!

penguins |> 
  mutate(bill_length_mm    = std_to_01(bill_length_mm), 
         bill_depth_mm     = std_to_01(bill_depth_mm), 
         flipper_length_mm = std_to_01(flipper_length_mm), 
         body_mass_g       = std_to_01(body_mass_g)
  )

Recall across()!

penguins |> 
  mutate(across(.cols = bill_length_mm:body_mass_g,
                ~ std_to_01(.x)
                )
  )

New option: Variables as Arguments!

std_column_01 <- function(data, variable) {
  
  stopifnot(is.data.frame(data), 
            is.numeric(variable)
            )
  
  data <- data |> 
    mutate(variable = std_to_01(variable))
  
  data
  
}

Note

Notice how I relied on the existing function std_to_01() inside the new function, for clarity!

But it didn’t work…

std_column_01(penguins, body_mass_g)
Error in std_column_01(penguins, body_mass_g): could not find function "std_column_01"

Tidy evaluation

Functions that use unquoted variable names as arguments are called nonstandard evaluation or tidy evaluation.

Tidy:

penguins |> 
  pull(body_mass_g)


penguins$body_mass_g

Untidy:

penguins[, "body_mass_g"]


penguins[["body_mass_g"]]

Solution 1 🤷

  • Just don’t use tidy evaluations in your functions
  • Harder to read / use, but safe
std_column_01 <- function(data, variable) {
  
  stopifnot(is.data.frame(data), 
            is.character(variable))
  
  data[[variable]] <- std_to_01(data[[variable]])
  
  data
}

std_column_01(penguins, "bill_length_mm")

Solution 2 – Embrace Injection library(rlang)

In February 2020 rlang introduced the “injection” {{ }} operator to simplify writing functions around tidyverse pipelines.

With the {{ }} operator you can inject the name of data-variables (i.e.  columns from the data frames) into function arguments!

Warning

This only works for select() type functions, that use a literal (tidy) name of the variable to subset the data.

Recall Our Broken Function

std_column_01 <- function(data, variable) {
  
  data <- data |> 
    mutate(
      variable = std_to_01(variable)
    )
  
  data
  
}

std_column_01(penguins, body_mass_g)
Error in `mutate()`:
ℹ In argument: `variable = std_to_01(variable)`.
Caused by error in `stopifnot()`:
! object 'body_mass_g' not found
  • The problem here is that mutate() defuses the R code it was supplied.
  • Instead we want it to see body_mass_g = standardize(body_mass_g).

This is why we need injection!

{{ variable }}

std_column_01 <- function(data, variable) {
  
  stopifnot(is.data.frame(data))
  
  data <- data 
    mutate({{ variable }} = std_to_01( {{ variable }})
           )
  data
}
Error: <text>:6:27: unexpected '='
5:   data <- data 
6:     mutate({{ variable }} =
                             ^

Danger

Oh no! What happened?

The left hand side of = is also diffused!

The “Walrus Operator” :=

The “walrus operator” := is an alias of =.

You can use it to supply names, e.g. a := b is equivalent to a = b.

std_column_01 <- function(data, variable) {
  
  stopifnot(is.data.frame(data))
  
  data <- data |>
    mutate({{ variable }} := std_to_01( {{ variable }})
           )
  data
  
}

Don’t forget about across()

What if I want to modify multiple columns?

std_column_01 <- function(data, variables) {
  
  stopifnot(is.data.frame(data))
  
  data <- data |> 
    mutate(across(.cols = {{ variables }}, 
                  ~ std_to_01(.x)
                  )
           )
  
  data
  
}


std_column_01(penguins, bill_length_mm:body_mass_g)

Missing Data are important!

Assumptions when removing missing data

Without inspection:

Observations are “missing completely at random”

With information about the “missingness”:

Observations are “missing at random”

Look for patterns!

Missing data – Example

If fish length measurements are missing at random, conditional on month, year, and river section,

then the distributions of lengths will be similar for fish of the same month, year, and river section.

Scaling Variables

Why Scale?

  • Easier to compare across variables

  • Easier to model (standardizes variance)

Why not Scale?

  • Difficult to interpret

Interesting reads

Article on How Building Functions with Variable Names has Changed Over the Years

rlang Article on Data Masking